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Abstract 

We consider the problem of optimizing an approximately convex function over a bounded convex 
set in K" using only function evaluations. The problem is reduced to sampling from an approxi¬ 
mately log-concave distribution using the Hit-and-Run method, which is shown to have the same 
0* complexity as sampling from log-concave distributions. In addition to extend the analysis for 
log-concave distributions to approximate log-concave distributions, the implementation of the 1- 
dimensional sampler of the Hit-and-Run walk requires new methods and analysis. The algorithm 
then is based on simulated annealing which does not relies on first order conditions which makes it 
essentially immune to local minima. 

We then apply the method to different motivating problems. In the context of zeroth order stochas¬ 
tic convex optimization, the proposed method produces an c-minimizer after 0’* noisy func¬ 

tion evaluations by inducing a 0ie/ u)-approximately log concave distribution. We also consider in 
detail the case when the “amount of non-convexity” decays towards the optimum of the function. 
Other applications of the method discussed in this work include private computation of empirical 
risk minimizers, two-stage stochastic programming, and approximate dynamic programming for on¬ 
line learning. 


1 Introduction and Problem Formulation 

Let ^ c IR” be a convex set, and let F: R" — R be an approximately convex function over JF in the sense 
that 


sup |F(x)-/(x)| <c/n (1) 

xejtr 

for some convex function / : R” — R and e> 0. In particular, F may be discontinuous. We seek to find 
X e JF such that 


Fix)- min F[y) <e (2) 

yeJfT 

using only function evaluations of F. This paper presents a randomized method based on simulated 
annealing that satisfies (2) in expectation (or with high probability). Moreover, the number of required 
function evaluations of F is at most 0*in'^'^) (see Corollary 1), where 0* hides polylogarithmic factors 
in n and Our method requires only a membership oracle for the set JF. In Section 7, we consider 
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the case when the amount of non-convexity in (1) can he much larger than eln for points away from the 
optimum. 

In the oracle model of computation, access to function values at queried points is referred to as the 
zeroth-order information. Exact function evaluation of F may he equivalently viewed as approximate 
function evaluation of the convex function /, with the oracle returning a value 

F{x) E[f{x)-el n, f{x) + el n\. (3) 

A closely related problem is that of convex optimization with a stochastic zeroth order oracle. Here, 
the oracle returns a noisy function value fix) + ri.liT] is zero-mean and suhgaussian, the function values 
can he averaged to emulate, with high prohahility, the approximate oracle (3) . The randomized method 
we propose has an©* irf'^e~^) oracle complexity for convex stochastic zeroth order optimization, which, 
to the best of our knowledge, is the best that is known for this problem. We refer to Section 6 for more 
details. 

The motivation for studying zeroth-order optimization is plentiful, and we refer the reader to Conn 
et al. (2009) for a discussion of problems where derivative-free methods are essential. In Section 8 we 
sketch three areas where the algorithm of this paper can be readily applied; private computation with 
distributed data, two-stage stochastic programming, and online learning algorithms. 

2 Prior Work 

The present paper rests firmly on the long string of work by Kannan, Lovasz, Vempala, and others (Lovasz 
and Simonovits, 1993; Kannan et al., 1997; Kalai and Vempala, 2006; Lovasz and Vempala, 2006a,b, 2007). 
In particular, we invoke the key lower bound on conductance of Hit-and-Run from Lovasz and Vempala 
(2006a) and use the simulated annealing technique of Kalai and Vempala (2006). Our analysis extends 
Hit-and-Run to approximately log-concave distributions which required new theoretical results and im¬ 
plementation adjustments. In particular, we propose a unidimensional sampling scheme that mixes fast 
to a truncated approximately log-concave distribution on the line. 

Sampling from /1-log-concave distributions was already studied in the early work of Applegate and 
Kannan (1991) with a discrete random walk based on a discretization of the space. In the case of non¬ 
smooth densities and unrestricted support, sampling from approximate log-concave distributions has 
also been studied in Belloni and Chernozhukov (2009) where the hidden convex function / is quadratic. 
This additional structure was motivated by the central limit theorem in statistical applications and leads 
to faster mixing rates. Both works used ball walk-like strategies. Neither work considered random walks 
that allow for long steps like Hit-and-Run. 

The present work was motivated by the question of information-based complexity of zeroth-order 
stochastic optimization. The paper of Agarwal et al. (2013) studies a somewhat harder problem of regret 
minimization with zeroth-order feedback. Their method is based on the pyramid construction of Ne- 
mirovskii and Yudin (1983) and requires ©in^^e~^) noisy function evaluations to achieve a regret (and, 
hence, an optimization guarantee) of e. The method of Liang et al. (2014) improved the dependence on 
the dimension to ©* irf'^) using a Ball Walk on the epigraph of the function in the spirit of Bertsimas and 
Vempala (2004). The present paper further reduces this dependence to ©*inf'^) and still achieves the 
optimal e~^ dependence on the accuracy. The best known lower bound for the problem is Q(n^e“^) (see 
Shamir (2012)). 

Other relevant work includes the recent paper of Dyer et al. (2013) where the authors proposed a sim¬ 
ple random walk method that requires only approximate function evaluations. As the authors mention. 
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their algorithm only works for smooth functions and sets ^ with smooth boundaries — assumptions 
that we would like to avoid. Furthermore, the effective dependence of Dyer et al. (2013) on accuracy is 
worse than . 

3 Preliminaries 

Throughout the paper, the functions F and / satisfy (1) and / is convex. The Lipschitz constant of / 
with respect to £oo norm will he denoted by L, defined as the smallest number such that |/(x) - /{y)| < 
ill^-ylloo for X, y e JF. Assume the convex body JF e R" to be well-rounded in the sense that there exist 
r,R> 0 such that S JF £ SS^iR) and R/r < ©{^/n)} For a non-negative function g, denote by Ug 

the normali z ed probability measure induced by g and supported on ^. 

Definition 1. A function : JF — IR+ is log-concave if 

h{ax-\- (1 - a)y) > fi{x)“fi(y)^““ 

for all x,y e JF and a e [0,1]. A function is called f-log-concave for some f>0 if 

h[ax+a - a)y) > e“^/t(x)“fi(y)^““ 


for all X, y e JF and a e [0,1]. 

Definition 2. A function g : JF ^ IR+ is ^ -approximately log-concave if there is a log-concave function 
: JF — R+ such that 

sup |logfi{x)-logg(x)| <^. 

xejtr 

Lemma 1. If the function g is f/2-approximately log-concave, then g is f-log-concave. 

For one-dimensional functions, the above lemma can be reversed: 

Lemma 2 (Belloni and Chernozhukov (2009), Lemma 9). Ifg is a unidimensional f-log-concave function, 
then there exists a log-concave function h such that 

e~^hix) < g{x] < h{x) forallxeU. 

Remark 1 (Gap Between )S-Log-Concave Functions and ^-Approximate Log-Concave Functions). A con¬ 
sequence of Lemma 2 is that f-log-concave functions are equivalent to f-approximately log-concave func¬ 
tions when the domain is unidimensional. However, such equivalence no longer holds in higher dimen¬ 
sions. In the case the domain is R”, Green et al. (1952); Cholewa (1984) established that f-log-concave 
functions are ^log2[2n)-approximately log-concave. Laczkovich (1999) showed that there are functions 
such that the factor that relates these approximations cannot be less than | log 2 (n/ 2 ). □ 

We end this section with two useful lemmas that can be found in Lovasz and Vempala (2007). 

Lemma 3 (Lovasz and Vempala (2007), Lemma 5.19). Let : R" — R fie a log-concave function. Define 
Mfi := maxfi andL^it) = {x £ R” : fi(x) > t}. Then for 0< s < t < 

voliLhis)) DogiMhls) ^ 
yo\Hhit))~\\og{Mhlt)) 

Whis condition can be relaxed by applying a pencil construction as in Lovasz and Vempala (2007). 
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Lemma 4 (Lovasz and Vempala (2007), Lemma 5.6(a)). Let X be a random point drawn from a log- 
concave distribution : IR — IR+ and letMh := maXxeR h{x). Then for every t>0 

PihiX)<t)<^. 

Mh 

4 Sampling from Approximate Log-Concave Distributions via Hit-and-Run 

In this section we analyze the Hit-and-Run procedure to simulate random variables from a distribution 
induced by an approximate log-concave function. The Hit-and-Run algorithm is as follows. 


Algorithm 1: Hit-and-Run 

Input: a target distribution Ug on XL induced by a nonnegative function g; x £ dom(g); linear 
transformation Z; number of steps m 
Output: a point x' e dom(g) generated by one-step Hit-and-Run walk 
initialization: a starting point x e dom(g); 

for i = 1,..., m do 

1. Choose a random line (that passes through x. The direction is uniform from the surface of 
ellipse given by Z acting on sphere; 

2. On the line £ run the unidimensional rejection sampler with Ug restricted to the line (and 
supported on XL) to propose a successful next step x '; 

end 


In order to handle approximate log-concave functions we need to address implementation issues 
and address the theoretical difficulties caused by deviations from log-concavity which can include dis¬ 
continuities. The main implementation difference lies is the unidimensional sampler. No longer a bi¬ 
nary search yields the maximum over the line and its end points since )6-log-concave functions can be 
discontinuous and multimodal. We now turn to these questions. 

4.1 Unidimensional sampling scheme 

As a building block of the randomized method for solving the optimization problem (2), we introduce a 
one-dimensional sampling procedure. Let g be a unidimensional /l-log-concave function on a bounded 
line segment £, and let Ug be the induced normalized measure. The following guarantee will be proved 
in this section. 

Lemma 5. Let g be a f-log-concave function and let£ be a bounded line segment £ on XL . Given a target 
accuracy e e (0, e~^^ 12), Algorithm 2 produces a point X e £ with a distribution figy such that 

d^,j ItT gy , 7T g £) ^c. 

Moreover, the method requires 0* {\) evaluations of the unidimensional f-log-concave function g iff is 

0 ( 1 ). 
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Algorithm 2: Unidimensional rejection sampler 
Input: unidimensional /3-log-concave function g defined on a bounded segment (. - [x, x]; 
accuracy c > 0 

Output: A sample x with distribution jfg/ close to ;rg/ 

Initialization: (a) compute a point p e ^ s.t. g{p) > maxz^^ g{z); 

(b) given target accuracy e, find two points e-i, ei on two sides of p s.t. 

e-i = xif g[x] > ^e~^eg[p), ^e~^eg[p) < g[e-i) <eg[p) otherwise 

(4) 

ei = X if g(x) > \ e~^eg{p), ^e~^eg[p) < g(ei) < igip) otherwise; 
while sample rejected do 

pickx- unif([e_i,ei]) and pick r ~ unif([0,1]) independently; 
if r < g(x)/{g(p)e^^} then 
accept X and stop; 

else 

reject x; 

end 

end 


The proposed method for sampling from the )S-log-concave function g is a rejection sampler that 
requires two initialization steps. We first show how to implement step (a). 


Algorithm 3: Initialization Step (a) 

Input: unidimensional )6-log-concave function g defined on a bounded interval / = [x,x] 
Output: a point pe£ s.t. g[p) > e~^^ max^e^ g(z) 
while did not stop do 

set X; = |x-H |x, Xc = ^x+ ^x and Xr = |x-(- |x; 


if I loggCx/) - logg(Xr) I > /S then 

set [x, x] as either [x;, x] or [x, x^] accordingly; 

else 

if |logg(x/)-logg(Xc)| >)6then 
I set [x, x] as either [x/, x] or [x, xd accordingly; 
else if I logglx, ) - loggCxdl > P then 
I set [x, x] as either [x, x^] or [Xc, x] accordingly; 

else 

output p = arg max fix) and stop ; 

XE{Xi,Xc,Xr] 


end 

end 

end 


For the /3-log-concave function g, let be a log-concave function in Lemma 2 and let L denote the 
Lipschitz constant of the convex function -logfi. In the following two results, the 0* notation hides a 
log(Z) factor. 
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Lemma 6 (Initialization Step (a)). Algorithm 3 finds a point p e £ that satisfies g[p) > e max^e^ gi^]. 
Moreover, this step requires ©* [1) function evaluations. 

Lemma 7 (Initialization Step (b)). Let £ = [x, x] and p e £. The binary search algorithm finds e-i e [x, p] 
and ei e [p,x] such that (4) holds. Moreover, this step requires 0* (1) function evaluations. 

According to Lemmas 5, 6, 7, the unidimensional sampling method produces a sample from a dis¬ 
tribution that is close to the desired /3-log-concave distribution. Furthermore, the method requires a 
number of queries that is logarithmic in all the parameters. 


4.2 Mixing time 

In this section, we will analyze mixing time of the Hit-and-Run algorithm with a /3/2-approximate log- 
concave function g, namely 


3 log-concave h s.t. sup|logg-logh| < )3/2. 

jr 


(5) 


In particular, this implies that g is )S-log-concave, according to Lemma 1. In this section, we provide the 
analysis of Hit-and-Run with the linear transformation Z = / and remark that the results extend to other 
linear transformations employed to round the log-concave distributions. 

The mixing time of a geometric random walk can be bounded through the spectral gap of the in¬ 
duced Markov chain. In turn, the spectral gap relates to the so called conductance which has been a key 
quantity in the literature. Consider the transition probability of Hit-and-Run with a density g, namely 


PliA) = 


2 r g{x)dx 
min Ja Pg(u,x)lx- 


where pg{u,x) = fe(u x)njff Let 7ig[x) = j—be the probability measure induced by the 

function g. The conductance for a set S c JC with 0 < TTg (S) < 1 is defined as 


(pHS) 


/xe s -Pf \ 

min{:;rg (S), Ug (JC \ S)} ’ 


and (p^ is the minimum conductance over all measurable sets. The s-conductance is, in turn, defined as 


0f = 


inf 

,s<iig (S)<l/2 


fxEsPx(.^'^S)dng 

7tg(S] - S 


By definition we have cp^ < (pf for all s > 0. 

The following theorem provides us an upper bound on the mixing time of the Markov chain based on 
conductance. Let be the initial distribution and the distribution of the m-th step of the random 
walk of Hit-and-Run with exact sampling from the distribution Ug restricted to the line. 


Theorem 1 (Lovasz and Simonovits (1993); Lovasz and Vempala (2007), Lemma9.1). LetO < s< 1/2 and 
let g:Ji£ — IR+ be arbitrary. Then for every m > 0, 


dtv(?rg,(7''”’) < Ho 


1 - 


icpsy 


and ^ Hs + 


s 




where 

Ho = sup ng[x)I[x) and = sup{|7rg(A) - cr® (A)|: :;rg(A) < s}. 
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Building on Lovasz and Vempala (2006a), we prove the following result that provides us with a lower 
hound on the conductance for Hit-and-Run induced hy a log-concave h. The proof of the result below 
follows the proof of Theorem 3.7 in Lovasz and Vempala (2006a) with modifications to allow unbounded 
sets without truncating the random walk. 


Theorem 2 (Conductance Lower Bound for Log-concave Measures with Unbounded Support). Let h 
be a log-concave function in K” such that the level set of measure ^ contains a hall of radius r. Define 
R = (,Eh\\X- where Zh = andX is sampled from the log concave measure induce by h. Then 

for any subset S, withrihiS) = p<\, the conductance of Hit-and-Run satisfies 




1 


Cl n 7 log 


2 nR 
rp 


where Ci > 0 is a universal constants. 


Although Theorem 2 is new, very similar conductance bounds allowing for unbounded sets were 
establish before. Indeed in Section 3.3 of Lovasz and Vempala (2006a) the authors discuss the case of un¬ 
bounded JC and propose to truncate the set to its effective diameter and use the fact that this distribution 
would be close to the distribution of the unrestricted set. Such truncation needs to be enforces which 
requires to change the implementation of the algorithm and lead to another (small) layer of approxima¬ 
tion errors. Theorem 2 avoids this explicit truncation and truncation is done implicitly in the proof only. 
We note that when applying the simulated annealing technique, even if we start with a bounded set, by 
diminishing the temperature, we are effectively stretching the sets which would essentially require to 
handle unbounded sets. 

We now argue that conductance of Hit-and-Run with )S-approximate log-concave measures can be 
related to the conductance with log-concave measures. 


Theorems (Conductance Lower Bound forApproximate Log-concave Measures). Letg beaflZ-approximate 
log-concave measure and h be any log-concave function with the property (5). Then the conductance and 
s-conductance of the random walk induced by g are lower bounded as 

> e~‘^^(p^ and cpf > 


We apply Theorem 1 to show contraction of a*™' to tZg in terms of the total variation distance. 

Theorem 4 (Mixing Time for Approximately-log-concave Measure). Let Ug is the stationary measure 
associated with the Hit-and-Run walk based on a flZ-approximately log-concave function g, and M = 
llu^/TTgll = f [da^^^ t dn g)da^^\ There is a universal constant C < oo such that for any j e (0,1/2), if 


m > Cn‘ 




log'^ 


e^MnR 

rji 


log 


M 

r 


then m steps of the Hit-and-Run random walk based on g yield 


Remark 2. The value M in Theorem 4 bounds the impact of the initial distribution cr^^^ which can be 
potentially far from the stationary distribution. In the Simulated Annealing application of next section, 
we will show in Lemma 8 that we can “warm start” the chain by carefully picking an initial distribution 
such thatM = ©[\). 
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Theorem 4 shows y-closeness between the distribution a*™' and the corresponding stationary distri¬ 
bution. However, the stationary distribution is not exactly g since the unidimensional sampling proce¬ 
dure described earlier truncates the distribution to improve mixing time. The following theorem shows 
that these concerns are overtaken by the geometric mixing of the random walk. Let tig/ denote the dis¬ 
tribution of the unidimensional sampling scheme (Algorithm 2) along the line £ and denote the 
distribution of the unidimensional sampling scheme proportional to g along the line £. 

Theorem 5. denote the distribution oftheHit-and-Run with the unidimensional sampling scheme 

(Algorithm 2) after m steps. For any 0< s<l/2, the algorithm maintains that 






-H m sup dtviiig,e,^g,e) + ■ 

t'cjr 


Hs + 


s 




where the supremum is taken over all lines £ in Jif. In particular, for a target accuracy ye (0,1/e), if 
< y/8, s such that FIs ^ y/4, m > {2/((/)f)^}log({iTs/s}{4/y}), and the precision of the unidi¬ 
mensional sampling scheme to bee = ye“^^/{12m}, we have 


5 Optimization via Simulated Annealing 

We now turn to the main goal of the paper: to exhibit a method that produces an e-minimizer of the 
nearly convex function F in expectation. Fix the pair f,F with the property (1), and define a series of 
functions 

hiix) = exp(-//r/), giix) = exp[-F/Ti) 

for a chain of temperatures {Ti,i = I,..., K} to be specified later. It is immediate that hfs are log-concave. 
Lemma 1, in turn, implies that gfs are ^-log-concave. 

We now introduce the simulated annealing method that proceeds in epochs and employs the Hit- 
and-Run procedure with the unidimensional sampler introduced in the previous section. The overall 
simulated annealing procedure is identical to the algorithm of Kalai and Vempala (2006) , with differences 
in the analysis arising from F being only approximately convex. 
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Algorithm 4: Simulated annealing 

Input: A series of temperatures {Ti,l < i < K}, K=number of epochs, x e intJif 
Output: a candidate point x for which F(x) < miny^j^Fiy) + e holds 

initialization: well-rounded convex body and {Xg, 1 < j < N} i.i.d. samples from uniform 
measure on JF, Af-number of strands, set and Zq = /; 

while i-th epoch, 1 < i < iC do 

1. calculate the i-th rounding linear transformation 3~i based on 1 < j < N} and let 

o Zj-i ; 

2. draw N i.i.d. samples {Xj,l < j < N} from measure Tig. using Hit-and-Run algorithm with 
linear transformation Z; and with N warm-starting points {Xl_-^, l< j < N}; 

end 

output x= argmin F{X^.). 


Before stating the optimization guarantee of the above simulated annealing procedure, we prove that 
the warm-start property of the distributions between successive epochs and the rounding guarantee 
given by N samples. 

5.1 Warm start and mixing 

We need to prove that the measures between successive temperatures are not too far away in the £2 
sense, so that the samples from the previous epoch can be treated as a warm start for the next epoch. The 
following result is an extension of Lemma 6.1 in (Kaiai and Vempala, 2006) to /3-log-concave functions. 

Lemma 8. Let gix) = exp(-F(x)) be a p-log-concave function. Let pi be a distribution with density pro¬ 
portional to exp{-F(x)/r,}, supported on .XL. Let Tj = Tj-i Then 

lip,7p,+111 <Cy = 5exp(2/3/r,) 

Next we account for the impact of using the final distribution from the previous epoch as a 
“warm-start." 

Theorem 6. Fix a target accuracy ye (0,1/e) and let g be an flZ-approximately log-concave function 
mIR”. Suppose the simulated annealing algorithm (Algorithm 4) is run for K = \/nlog(l/p) epochs with 
temperature parameters Ti = (1 - l/\/n)S0 < i < K. If the Hit-and-Run with the unidimensional sam¬ 
pling scheme (Algorithm 2) is run for m = 0* [n^) number of steps prescribed in Theorem 4, the algorithm 
maintains that 


^^tv(d[“’,7rg.) < ey (6) 

at every epoch i, where is the distribution of the m-th step of Hit-and-Run. Here, m depends polylog- 

arithmically on p~^. 
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5.2 Rounding for /3-log-concave functions 

The simulated annealing procedure runs N = 0* [n) strands of random walk to round the log-concave 
distribution into near-isotropic position (say 1/2-near-isotropic) at each temperature. The N strands do 
not interact and thus the computation within each epoch can be parallelized, further reducing the time 
complexity of the algorithm. For N i.i.d isotropic random vectors Xi e K”, I < i < N sampled from a 
log-concave measure, the following concentration holds when N is large enough: 


sup 

Iklh2 = l 


N 


— YXiXj 

AT ^ ‘ i 


Nt 


1=1 


v-l 


1 

2 


or, equivalently. 


1 

2 


— O'min 


N 


AT ^ ‘2 




N 


^Omax .. 


Nti 


3 

< — 

2‘ 


(7) 


Theorems of this type have been first achieved for uniform measures on the convex body Xf (measures 
with bounded Orlicz 1/12 norm). Bourgain (1996) proved this holds as long as N > Cnlog^ n. Rudelson 
(1999) improved this bound to N > Cnlog^ n. For log-concave measures (with bounded Orlicz y/i norm 
through Borell’s lemma), Guedon and Rudelson (2007) proved a stronger version where N > Cnlogn. 
See also (Adamczak et al., 2010) for further improvements. For bounded (almost surely) vectors, we 
can instead appeal to the the following literature. Theorem 5.41 in (Vershynin, 2010) yields a spectral 
concentration bound for heavy tail random matricies with isotropic independent rows. (See also Tropp 
(2012) Theorem 4.1 for matrix Bernstein’s type inequalities.) 

For our problem, we need to prove (7) for independent near isotropic rows with )S-log-concave mea¬ 
sures. There are two ways to achieve this goal. The first is to invoke the Guedon and Rudelson (2007) ’s re¬ 
sult. Random vectors sampled from /1-log-concave still belong to the Orlicz family, thus N > 0inlogn] 
is enough to achieve the goal with high probability. The second way is through the following lemma: 


Lemma 9 (Vershynin (2010), Theorem 5.41). Let X be an N x n matrix whose rows Xi are independent 
isotropic random vectors in IR”. Let Rbe a number such that \\Xi \\(^<R almost surely for all i. Then for 
every t > 0, one has 

\/N- tR < amin(^) ^ crniax(^) ^ + tR 

with probability at least \-2nexp{-cf'), where c> 0 is an absolute constant. 

Glearly if we take N > CRf log n, we have 


1 

2 — ^min 


N 


AT ^ ‘■I 




N 


^CTmax 


^r=i 


3 

< — 
2 


with probability at least 1 - n~^, since XL is uniformly bounded within R = 0{\/n)r and isotropic con¬ 
dition implies r = 0{l) (which translates into llXdl^ < 0{\/n)]. Thus we conclude that N = Qinlogn] is 
enough for bringing a /1-log concave measure into isotropic position. 


5.3 Optimization guarantee 

We prove an extension of Lemma 4.1 in (Kalai and Vempala, 2006) : 
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Theorem 7. Let f be a convex function. LetX be chosen according to a distribution with density propor¬ 
tional to exp{-/(x) / T}. Then 

EffiX) - min fix) <in+l)T 

■' xeJT 

Furthermore, ifF is such that\F- /|oo ^ p, for X chosen from a distribution with density proportional to 
exp{-F(x)/r}, we have 

EF/(X)-min/(x) < (n+ 1)7•exp(2p/r) 

xeJfT 

The above Theorem implies that the final temperature Tk in the simulated annealing procedure 
needs to be set as Tk = eln. This, in turn, leads to K = \/nlogin/e) epochs. The oracle complexity of 
optimizing F is then, informally, 

0* (n^) queries per sample x (n) parallel strands x 0’* (yTi) epochs = 0*in'^-^) 

The following corollary summarizes the computational complexity result: 

Corollary 1. Suppose F is approximately convex and 17-/1 < eln as in (1). The simulated annealing 
method with K = x/nlogfn/c) epochs produces a random point such that 

Ef[X) - min fix) < e, 

xej?f 


and thus 

E7(X) - minFfx) < 2e. 
xeJ?r 

Furthermore the number of oracle queries required by the method is0* in'^-^). 

6 Stochastic Convex Zeroth Order Optimization 

Let f : — IR be the unknown convex L-Lipschitz funciton we aim to minimize. Within the model 

of convex optimization with stochastic zeroth-order oracle 0, the information returned upon a query 
X e is fix) + Cx where Cx is the zero mean noise. We shall assume that the noise is sub-Gaussian with 
parameter a. That is, 

EexpfAcjc) < exp(c7^A^/2). 

It is easy to see from Chernoff’s bound that for any f > 0 

E>i\ex\>(Jt) <2expi-t^ 12). 

We can decrease the noise level by repeatedly querying at x. Fix t > 0, to be determined later. The average 
Cx of T observations is concentrated as 

P(|Cxl ^crtls/f) < 2exp(-t^/2). 

To use the randomized optimization method developed in this paper, we view fix) + Cx as the value of 
Fix) returned upon a single query at x. Since the randomized method does not re-visit x with probability 
1, the function F is “well-defined”. 

Let us make the above discussion more precise by describing three oracles. Oracle O' draws noise 
Cx for each x e Xf prior to optimization. Upon querying x e Jd, the oracle deterministically returns 
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fix) + Cx, even if the same point is queried twice. Given that the optimization method does not query 
the same point (with probability one), this oracle is equivalent to an oblivious version of oracle O of the 
original zeroth order stochastic optimization problem. 

To dehne Oa, let be an a-net in £oo which can be taken as a box grid of JIT. If JG S RBoo, the size 
of the net is at most [Rla)". The oracle draws Cx for each element x e independently. Upon a query 
x’ e jy , the oracle deterministically returns fix) + Cx for x e a which is closest to x'. Note that Oq; is no 
more powerful than O', since the learner only obtains the information on the a-net. 

Oracle O^ is a small modihcation of O^. This modihcation models a repeated query at the same 
point, as described earlier. Parametrized by t (the number of queries at the same point), oracle 0^ draws 
random variables Cx for each x e but sub-Gaussian parameter of Cx is (tI\/t. The optimization 
algorithm pays for r oracle calls upon a single call to 0^. 

We argued that O^ is no more powerful than the original zeroth order oracle given that the algorithm 
does not revisit the point. In the rest of the section, we will work with as the oracle model. For any x, 
denote the projection to the JVa to be (x). Define F:J^ as 


Fix) = fiS^J^^ (x)) -H (X) 

where (x) is the closest to x point of JYa in the (^o sense. Glearly, 

IF-/loo < max \ex\ + aL 


( 8 ) 


where L is the (^oo) Lipschitz constant. Since (ex)x£.w„ define a finite collection of sub-Gaussian random 
variables with sub-Gaussian parameter a, we have that with probability at least 1-5 


max |Cxl ^ cr\ 
x£jV„ ' 


IZnlogiRla) -i- 2log(l/5) 


From now on, we condition on this event, which we call S. To guarantee (1), we set 

e 


2n 


= (J\ 


1 2nlog(F/a) -i-2log{l/5) 


= aL 


where t is the parameter from oracle O^. We use the first equality to solve for t and the second to solve 
for a: 

a^n^iSnlogiRla) + 8log(l/5)) iSniogiZLRnIe) -(-8log(l/5)) 


T = 


= ©*in^le^) 


and a = eliZLn). Note here L affects t only logarithmically, and, in particular, we could have defined the 
Lipschitz constant with respect to £ 2 - We also observe that the oracle model depends on a and, hence, 
on the target accuracy e. However, because the dependence on a is only logarithmic, we can take a to be 
much smaller than e. 

Together with the 0’*(n^'^) oracle complexity proved in the previous section for optimizing F, the 
choice of T = 0”* in^e~^) evaluations per time step yields a total oracle complexity of 

&*in^-^e~^) 


for the problem of stochastic convex optimization with zeroth order information. We observe that a 
factor of in oracle complexity comes from the union bound over the exponential-sized discretization 
of the set. This (somewhat artificial) factor can be reduced or removed under additional assumptions on 
the noise, such as a draw from a Gaussian process with spatial dependence over JF. Alternatively, this 
factor could be removed completely if we could take a union bound over the polynomial number of 
points visited by the algorithm. Such an argument, however, appears to be tricky. 
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7 Optimization of Non-convex Functions with Decreasing Fluctuations 


Assume the non-convex function F(x) has the property that the “amount of non-convexity” is decreasing 
as X gets close to its global minimum x* . If one has some control on the rate of this decrease, it is possible 
to break the optimization problem into stages, where at each stage one optimizes to the current level of 
non-convexity, redefines the optimization region to guarantee a smaller amount of “non-convexity,” and 
proceeds to the next stage. We are not aware of optimization methods for such a problem, and it is rather 
surprising that one may obtain provable guarantees through simulated annealing. 

As one example, consider the problem of stochastic zeroth order optimization where the noise level 
decreases as one approaches the optimal point. Then, one would expect to obtain a range of oracle 
complexities between log(l/c) and 1 /c^ in terms of the rate of the noise decrease. 

Let us formalize the above discussion. Suppose there exists a 1-Lipschitz a-strongly convex function 
fix) with minimum achieved at x* e JF: 

fix)- fix*) > <V/(x*),x-x*> -I- ^||x- X* 11^ > ^||x-x* 11^. 

Define a measure of “non-convexity” of F with respect to / in a ball of radius r around x*: 

A(r):= sup |F(x)-/(x)|. 

,r) 

We have in mind the situation where A(r) decreases as r decreases to 0. At the tth stage of the optimiza¬ 
tion problem, suppose we start with an (2 ball SS^iXt-iyZrt) of radius 2rt with the property 

m^ix*,3rt)^SSfiXt-i,2rt):Dgs^ix*,rt). 

Next, we run the simulated annealing procedure for the approximately log-concave function defined 
over this ball. After 0* queries, we are provided with a point Xj such that in expectation (or high 
probability) 

fixt)-fix*) < Cn-M3rt) 


with some universal constant C > 0. Thanks to strong convexity, 

^ llXf - X* 11^ < fixt) - fix*) < Cn- M3rt) 


which suggests the recursive definition of rt+i: 


a 

2Cn 


' t+i 


a 

2Cn 


xt-x* \\]^<M3rt). 


At stage f-i-1 we restrict the region to be 5^"2rt+i) 3 Sg"ix* ,rt+i) and run the optimization algorithm 
again with the new parameter of approximate convexity. The recursion formula for the radius from to 
Tf+i satisfies 


2Cn 


D+i<A(3rt). 


The recursion formula yields a fixed point — a “critical radius” 


achieved, with = A(3r*). Let us explore two examples: 


r* where no further improved can be 


Polynomial: A(r) = cr^, 0 < p < 2, 
Logarithmic: A(r) = clog{l - 1 - dr), 
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where c,d > 0 are constants. For the polynomial case, the critical radius is r* = j^ and the 

required number of epochs is at most jf ^e want to get rt = (1+e) r. For the logarithmic 

case, the critical radius is the unique non-zero solution to 

2cCn. , p 

-log{l + 3dr) = r^. 

a 

We conclude that at an 0* (1) multiplicative overhead on the number of oracle calls, we can optimize to 
any level of precision above the fixed point r * of the non-convexity decay function. 

8 Further Applications 

in this section, we sketch several applications of the zeroth-order optimization method we introduced. 
Our treatment is cursory, meant only to give a sense of the range of possible domains. 

8.1 Private computation with distributed data 

Suppose i = 1,..., n are entities—say, hospitals—that each possess private data in the form of m covariate- 
response pairs {(x, j, A natural approach to analyzing the aggregate data is to compute a mini- 

mizer u;* of 


f{w) = — ,/,y, (9) 

mn 

for some convex regularization function R and a convex (in w) loss function £. For instance, £{Xij,yij; w) 
iyi,j - Xij ■ w)^ and Riw) = 0 would correspond to the problem of linear regression. 

Given that the hospitals are not willing to release the data to a central authority that would perform 
the computation, how can the objective (9) be minimized? We propose to use the simulated annealing 
method of this paper. To this end, we need to specify what happens when the value f{Wt) at the current 
point Wt is requested. Consider the following idea. The current Wt is passed to a randomly chosen 
hospital It ~ unif(l, The hospital, in turn, privately chooses an index Jt ~ unif(l,..., m), computes 
the loss £ixi,j,,yi,j,; u>t), adds zero-mean noise ~ Ai(0,1), and passes the resulting value 


Vt = £ixi^j,,yi^jy,wt] + T]t 


back to the central authority. Since the computation is done privately by the hospital, the only value 
released to the outside world is the noisy residual. It is easy to check that Vt is an unbiased estimate of 
f{Wt)\ 

Hvt] = f{wt) 

with respect to the random variables {/?,/;) and 77 f. Moreover, the noise level with respect to each source 
of randomness is of constant order. By repeatedly querying for the noisy value at Wt, the algorithm 
can reduce the noise variance, as in Section 6 , yet—importantly—the returned value is for a potentially 
different random choice of the hospital and the data point. This latter fact means that repeated querying 
does not allow the central authority to learn a specific data point. Interestingly, the additional layer of 
privacy given by the zero-mean noise rjt presents no added difficulty to the minimization procedure, 
except for slightly changing a constant in the number of required queries. 
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8.2 Two-stage stochastic programming 

Dyer et al. (2013) discuss the following mathematical programming formulation: 


max px +E [max{^y|lVj/< Tx-^, y e D?"*}] (10) 

subject to Ax < b, 

where q e IR”h W e and T e The expectation is taken over the random variable This 

problem is concave in x, and can be solved in two stages. If, given x, an approximate value for the 
inner expected maximum can be computed, the problem falls squarely into the setting of zeroth order 
optimization with approximate function evaluations. While the method of Dyer et al. (2013) is simpler, 
its dependence on the target accuracy e is worse. Additionally, the method of this paper can deal directly 
with constraint sets with non-smooth boundaries; the method can also handle more general functions 
in (10) that are not smooth. 

8.3 Online learning via approximate dynamic programming 

Online learning is a generic name for a set of problems where the forecaster makes repeated predictions 
(or decisions). For concreteness, suppose that on each round t = the forecaster observes some 

side information St e S, makes a prediction ft e Jb', and observes an outcome yt e The goal of the 
forecaster is to ensure small regret, defined as 

T T 

X^^FuTf)- inf 

t=\ t=i 

where ^ is a class of strategies, mapping S to Jtf, and £ : Jif x ^ ^ R is a cost function, which we as¬ 
sume to be convex in the first argument. The vast majority of online learning methods can be written as 
solutions to the following optimization problem (see Rakhlin et al. 2012): 

ft = argmin max{^(y, y^) + <I>t(si,yi,..., St, yr)} 
y€jf yt^‘3' 

where Of is a relaxation on the minimax optimal value. One of the tightest relaxations is the so-called 
sequential Rademacher complexity, which itself involves an expectation over a sequence of Rademacher 
random variables and a sup remum over the class ^. While the gradient of Of might not be available, 
it is often possible to approximately evaluate this function and solve the saddle point problem approxi¬ 
mately. 

A Proofs of Section 3 

Proof of Lemma 1. The proof is straightforward: 

g[ax+{l - a)y) > e~^‘^hiax+ {I - a)y) > e“^^^fi(x)“fi(y)^““ 

> e-^^2(e-^^2g(x))“(e-^'2g(y))i-“ > e“^g(x)“g(y)i-“. 


□ 


15 


B Proofs of Section 4.1 


Proof of Lemma 6. Consider a unidimensional /3-log-concave function g: IR — IR. In view of Lemma 2, g 
can be “sandwiched” by a log-concave function h such that e~^h{x) < g(x) < h[x). 

Given £, we want to find p e £ such that g(p) > e“^^max^e^g(z). We use the following 3-point 
method, inspired by Agarwal et al. (2013), to provide such a point. Let us work with the convex func¬ 
tion 


h = - logh 


and a nearly-convex function 


g = -logg. 


The sandwiching guarantee can be written as 


h{x) < g(x) < h[x) + f. 


We now claim that each iteration of the “while” loop of Algorithm 3 maintains the following property: 
either the length of the interval is reduced by at least 3/4 while still containing the optimal point, or we 
have the output point p that satisfies 


g{p) <ming-t3;6 

In the latter case, g(p) > max^e^ g{z) as desired. 

There are essentially two cases. First, ifg{x/)-g(Xr) > /3 (or similarly we can argue for |g(x/)-g(Xc)| > 
/3 and |g(Xr) - g(Xc)| > f], we have 

Hxi) + P> gixi) > giXr) + h{Xr) + f 

and thus hixf > h[Xr). Because of convexity of/i we can safely remove [x,x/] with the remaining interval 
still containing the point we are looking for. Second case is when 

\g{Xi]-giXr)\< P, |g{X/)-g(Xc)| <)S, |g(X,0-g(Xc)| <)S 

Here, we can show the function g{x) is flat enough for [x,x] and thus the best of xi,Xc,Xr are good 
enough. It is not hard to see that 

\h[xi)-h{Xr)\<2p, \hixi)-h{Xc)\<2p, \h{Xr) - h{Xc]\ <2p. 

Consider the point X/. By convexity of h, there must be a supporting line kiix) that is below the convex 
function h and such that fc/(x/) = (x/). Thus 

mmh{x) > mink;(x) > k/(x/) -2/3 = h{xi) -2^6 

lx,xi] lx,xi] 

using the fact that | x - x/1 = | x/ - Xc |. Similarly we can prove 

min h{x) > h{Xc) - 2/3, min h[x) > h[Xr) - 2/3. 

lXi,Xi.] lXr,X] 
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Thus 


minh(x) > mm{h{xi),h{Xc),h{Xr)) -2/3. 

[x,x] 


By sandwiching 


and, hence, 


mingCx) > min(g(x/), g{Xc),g{Xr)) - 3/3 

[x,x] 


g(p) <ming{x) + 3/3. 

[x,x] 


It remains to show that the algorithm will terminate in an (1) number of steps. Let L be the Lipschitz 
constant of h. By the time the interval is shrunk to |x- x| < (ilL, the algorithm must have entered the 
second case above and terminated. □ 

Proof of Lemma 7. Consider a unidimensional /3-log-concave function g: IR — IR. In view of Lemma 2, g 
can be “sandwiched” by a log-concave function h such that e~^h{x) < g(x) < h(x). 

We consider the interval [x/,x, ] = [p,x] (the other case follows similarly). Ifg{x) > ^e~^eg[p), setei = 
X. Otherwise we have g(x) < \ e~^eg{p) and we proceed. The procedure always query the midpoint x^ of 
current interval [x/,Xr]. If g{Xm) > £gip) set x/ = x^, or if gCx^) < ^e~^egip) set x^ = x^, and continue 
the search. Either operation halves the interval. If the midpoint Xm is such that \e~^eg{p) < giXm) ^ 
egip), stop the process and return ei = x^. At every iteration, the interval [x/,Xr] is such that g(x/) > 
£g[p) and g(x, ) < ^e~^eg{p). We now claim that the algorithm must terminate in an &* (1) number of 
steps. Let h = - log h, and let L be the Lipschitz constant of h. As soon as the length of the current interval 
|x/-Xrl < 1/(2L), wehave |h(x/)-h(Xr)| < 1/2. Thus h[xi]lh{xr) < andg(x/)/g{Xr) < implying 

that both g(x/) > egip) and g(Xr) < ^e~^egip] cannot be true at the same time as 2e^ > Hence, 

the algorithm terminates in a number of steps that is logarithmic in L. 

□ 


Proof of Lemma 5. Let h be the log-concave function associated with the /3-log-concave function g in 
the sense of Lemma 2, so that e~^hix) < g(x) < h[x) for all xe£, and Lfit) denote the (upper) level set of 
a function / at level t. 

We note that since Lgit) a L^it) and e~^h[x) < g(x) < h(x), (4) implies that 

Lhie^egip)) c [e_i, ei] c Lhi\e~^egip)). (11) 

Moreover, either e_i = x or g(e-i) < egip) which implies h(e_i) < e^egip) < \e~^gip) ife < ■ 

The stationary distribution for this sampling scheme is a truncated distribution according to the /3- 
log-concave distribution g restricted to [e_i,ei]. (Indeed, this correspond to the classic Accept-Reject 
method to simulate g based on the uniform distribution with constant M := gip)e^^, see Robert and 
Casella (2004) page 49.) Therefore 


dtv i^ g,r I ^ g,r) 


= dtying,fl{£ \ [e-i,ei]},ng^fl{£ \ [e-i, ei]}) -t dtying^(l{\e-i, ei]},ifg/l{[e-i, ei]}) 


<P>z~gizt [e-i,ei]) + 


P^,g(zg[e-i,ei]) 
l-Pz_g(zS[e_i,ei]) ■ 
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Nextwe verify that the truncation error (in the total variation norm) of restricting g to {e-i,e\\ instead 
of £ is of the desired order. By Lemma 4 which quantifies the tail decay of unidimensional log-concave 
measures, we have 


z~g 


(z t \e-i, eiD < {z € [e-i, eil) < [z € L/z(e^eg(p))j 

• Pz~h [z t Lhie^eMfS^ < ■ P^-h [hiz)< e^eM}^ < ■ 


< 


where we used (11). 

Thus, provided that < 1/2, the total variation distance between the truncated measure ftgj sup¬ 
ported on [e_i,ei] satisfies 

^ 3e ^c. 

In order to bound the number of evaluations we first bound the probability of the event the event 
{g(x) > \e~^^g{p)]. Indeed, by (11), if X ~ f/([e_i,ei]) and Z ~ U[Lfi[^ee~ ^g(p))). we have 


Px 


g(X)>-e-2^g(p)|>Px 


h{X) > -e Pg(p) 

V ^ 

h{Z) > ^e~^h{p) 


>Pz 

\/o\{Lhile~^hip))) 

yo\{Lhihee~Ph{p))) 


By Lemma 3 with s = ^fi(p) and f = ie ^fi(p), it follows that 


log 

vo\{Lh{le~Ph{p))) le-Ph(p) 


yo\{Lhi\ee ^hip))) log^- 


maxh 


yce Phip) 


log 


log^^ + log2 + /3 

maxh 

h(p) 


log2-t)S 


_^ log 2 

-tlog2-i-logl/c-t-;6 “ log2-t-)6-t-logl/c “ log(2/e)' 


Then, since r ~ t/([0,1]) we have 


-3/s 


r < 




glp) 


g(x)> -e ^^g(p) 


> -e-^^ 
2 


and thus 


r < 


^^gix) 

gip) 


e ^l^log2 
2log(2/a' 


Since we have a lower bound on the acceptance probability on each sampling step, the number of 


^ lo£[2 

iterations we need to sample is of the order 2 \o%( 2 ie) ~ "^1^1® quantity is 0’(log(l/e)) if (i is 0’(1). 


□ 


C Proofs of Section 4.2 

Proof of Theorem 3. Define the shorthand p = e^^^. By sandwiching, 

p~^h{x)<g{x)<ph{x) 
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Then 


p ^Tihix) <ng{x) < p^Tihix) 

p-^P^{A)<Pl{A)<p'^Pl{A) 


for any x,ueJ^ and Acz^. Thus we have 


(pHS)>p-^cf>HS). 

The s-conductance bound can be derived as follows. 

g . f 

0? = inf -—- 


A<z^ ,s<ng{A)<\l2 

ng[A) - s 

> p“® inf 

A,^P^[K\A)dnh 

AcJ^,s<ng(A)<ll2 

UhiA) - slp'^ 

> o“® inf 

A,^P^iK\A)dnh 

,sf p^<7ih{A)<l/2 


□ 


Proof of Theorem 4. The Hs defined in Theorem 1 can be upper bounded by 

I 


Hs = sup 

A:ng(A)<sJA] dUg 


j 

sJa 


-1 


d 7T p 


fridcrto* .,2 „ 

sup < / —-1 dTig ■ / dn 

A:ng(A)<s dUg ) JA 

u 


1/2 


< sup - 
AiJTg (A)<s 


f da^^^ 


“I 

1/2 


-1 dn 




Let us now use upper bound of Theorem 1 with s = ) ^nd D=^,as well as Theorem 3. We obtain 

dtv[ng,a^"^^)< f + 


2^7 


7 2M 

<- + - 

2 7 


ip 


7 2M 
< - H-exp 

2 7 


‘ m{p 


V 


where p = e^'^. In view of Theorem 2, 




cr 


nRiog^[- 
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we arrive at 


, y 2M 

dtv{ng,o^ ') < - + — exp 


m 

~2 


( ^2^ 
p~^cr 

n, 2 P^nRM 


Hence, if 

^ Ap'^MnR 

m>Cn^^ „ log^^ „ 



then 

dtv(?rg,(7'™’) <y. 


log — 

r 


□ 


Proof of Theorem 5. Step 1. (Main Step) By the triangle inequality we have 

dty{a'^"^\7ig) < + dtyia'^'^\7ig] 


( 12 ) 


The last term in (12) converges to zero at a geometric rate in m hy Theorem 4. Specifically we have 

n m 

(0f] 


dtvicr'^^\ng) <Hs + 


Hs ‘ 


1 - 


To bound the total variation distance after m steps between the two random walks 

from their corresponding starting distributions and write for any measurable set A 


(t'“>(A)= / (P^f'^>(A)a^^>dx and a^"‘>(A)= I (P^f'"HA)cr^^>dx 




r 41 a ( 0 ) 


(m). 




■iSAm), 


r(0). 


SO that 


sup dtv((i’f)^'”',(Pf)'“’)+2dtv(d™,a™). 
uejr '■ ' 


The result follows from Step 2 that shows sup^^^jj^ dtv ^ ^sup^^j^^ dtviAg,ri^g,r)- 

Step 2. (Error Propagation Bound in m Steps) The unidimensional sampling scheme produces a 
sample from a truncated distribution (see Lemma 5). That is, at each step of the Hit-and-Run algorithm, 
we are sampling from a truncated measure according to a truncated function g along each line £ of g 
(approximately-log-concave function in K”). Let us denote the transition probability kernel starting from 
u for this truncated function to be Pf and the kernel for the original function is Pf. Let us bound the 
total variation distance between these two kernels through the spherical (elliptical) coordinate system, 
and with pi-) being the density corresponding to the measure P(-)). 

Suppose that sup^^jj^ dtviAg,e,^g,r) ^ £■ Since pf (0) = pf (0) = p(0), it holds that 

2dtyiPi,Pl) = ^ \piir\d)piie) - plir\e)pliB)\drdB 
J |pf(r|0 )-pf(r|0)|dr|p{0)d0 
<2e f piB]dB = 2e 
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where on each line (over all 6 according to the measure given hy the linear transformation composed 
with uniform direction) the truncated distribution is an e approximation to Ug. We now claim that the m- 
folditerate ofthe Hit-and-Run kernel satisfies rftv ^ Let us prove this by induction. 

Suppose it holds for m - 1 steps. Then 

(pf)''”“^’(y)pf(x)-(pf)'™-i>(y)pf(x)dy|dx 
+ flj (pf)''^“^’(y)pf(x)-(pf)''”-i>{y)pf(x)tfy|dx 
^ f pf(x)dxj dy 

+ J (y) IJ |pf(x)-p|{x)|dxj dy 

< 2dtv (Pf)'“-^>) +2max dtUP^,Pp 

< 2(m - l)e + 2c = 2me 


□ 

Proof of Theorem 2. The proof follows closely the arguments in the proof of Theorem 3.7 in Lovasz and 
Vempala (2006a) for bounded sets with modifications to avoid the truncation device discussed in Section 
3.3 of Lovasz and Vempala (2006a). Define the step-size F(x) by P(||x - y|| < F{x)) = 1/8 where y is a 
random step from x. Next define A(x, t) = vol((x-i- tB) n L(|/(x)))/vol(fR) and s(x) = sup{f > 0 : A(x, t) > 
63/64}. Finally a(x) = infjr > 3 : P(/(y) > f/(x)) < 1/16} where y is a hit-and-run step from x. 

Let = Si u S 2 be a partition into measurable sets, where Si = S and p = rihiS-i) < TihiSz)- For for 
D = Rlog(C'n^/p) we will prove that 

[ PxiS2)dx> -^— —TZhiSi). (13) 

Jsi CnD log ^ 

Consider the points that are deep inside these sets with respect to 1-step distribution 

Si = {x£ Si :Px(S 2 )< 1/1000} and S^ = {x e S 2 : Px(Si) < 1/1000}, 

and the complement S^ = JC\Si u S^. 

Suppose 7i}iiS[] < nh[Si)l2. Then 

r 1,1 

I PxiS2)dx> -:/r;;(Si\Si) >- nhiSi) 

Js, 1000 ^ 2000 

which proves (13). Thus we can assume 

7r;,(Si)>a:;;(Si)/2 and Jih{S'2)>nh{S2)l2. (14) 

Define the exceptional subset W as set of points u for which a(M) is very large 

VF = VFi u IV 2 , where Wi ;= {u e S: a(u) > 2^^nDIp] and W 2 := {u e JC ; ||x- zi;|| > D}. 
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By Lemma 6.10 in Lovasz and Vempala (2007), n^iWi) < pl{2^^nD} and, by Lemma 5.17 in Lovasz and 
Vempala (2007), nh{W 2 ) < Now for any ue S'^WL and ve 

dtviPu,Pv) > PuiSi) - PviSi) = 1 - PuiSi) - PviSi) > 1 - 


from the definition of S'^ and S 2 . Thus by Lemma 6.8 of Lovasz and Vempala (2006b) , we bave 

1 1 1 
dhiu,v )>^———-—^ or |M-y|>-—max{F(u),F(i;)}. 
128log(3 + a(M)) 2^2 log ^ 4vn 

By Lemma 3.2 of Lovasz and Vempala (2006b) , tbe latter implies that 


\u-v\> —— max{ 5 (M),s(y)} 

2^y/n 

In either case, by Lemma 3.5 in Lovasz and Vempala (2006a), for any point x e [u, v], we bave 


nD 


s(x) < 2 ^^log- \u- v\\/li 

P 

14. _ 

<2 \og — djfr\W2iu,v)D\/n 


where tbe second inequality follows from u, y e JF \ W 2 . 

Recall the original partition S^, S 2 and S 3 = JF \ {S^ u S^} of JF. We will apply Theorem 2.1 of Lovasz 
and Vempala (2006b) with a different partition. Consider tbe partition of JC \ W 2 defined as Si = S'^ \ MV, 
S 2 = S^i W and S 3 = .^ \ {W 2 u S'^ u S^}. These definitions imply that S 3 c S 3 u W so that 

7ihiS3]<nh{S2) + nh{W). (15) 


Define for x £ JC \ W 2 


h[x) = 


six) 

2^^DV^\ogf 


It follows that for any u e Sj\W and i; e S^WL and x e [u, v], we have h[x) < dj^\w 2 iu, v)/3, and since 
JC \ W 2 is a convex body, we bave by Theorem 2.1 of Lovasz and Vempala (2006b) that 



hix)dx > 


fji^\W 2 hix]hix)dx 
/jr\W 2 dix)dx 


min 



h[x)dx,J h(x)dx| 


(16) 


Although E; 2 ( 5 (x)) is large, we need a lower bound on six)h[x)dxl h[x)dx. Since 5 (x) can 
be large if ^ is unbounded we modify the standard bound next. Because the level set of measure 1/8 
contains a ball of radius r, we have 


/jfr\W 2 six)hix)dxl h[x)dx 


= fjf\W 2 /o"“ dthix)dxl fj. hix)dx 
~ fo /{xejr\W 2 :A(x,f)> 63 / 64 } hMdxdtl fjr hix)dx 
- h-fo /{xejr\W 2 :A(x,f)> 3 / 4 } hix)dxdt/ h{x)dx 


> — r 

- 16 Jc 

> — r 

- 16 Jc 


(/{xejr:A(x,f)>3/4} h(x)dx- h(x)dxj^ dt/h[x)dx 
dt 


(i 


1 _ 12t^/n _ 

2 r Cn^]^ 


212 VH 
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where the first and third inequality follows from page 998 in Lovasz and Vempala (2006h), the second by 
definition of 14^2 where we take C and n large enough. Therefore, dividing both sides of (16) by h{x)dx, 
(15), TihiS'j) > 7 ih{Si)l2, i = 1,2, and p = nh{Si) < TT/jiSa), we have 


UhiS'^) + UhiW) > unm > hx)Hx)dx 

/jr\W2 h{x)dx 

2i6Dv/7Ilog^ ' " 

{mm{nh{S[),7ihiS2]} - TihiW)} 


228 n(D/r) log ^ 


228 n(D/r) log ^ [2 


1 


- min{:T:/j (Si), TT;; (S 2 )} - (IV) 


228 n(D/r) jQg 


«D I' 


f TT/i (Si) 


- p/4t > 


^ h (Si) 


232n(D/r)log^ 


where we used that JihiW) < p/4. Therefore, 


f _ 

Jsi 2000 Cn{D/r)log(nD/p) 


□ 


D Proofs of Section 5 

Proof of Lemmas. Define 

With this notation, we have that 

Define 


Yia) = I e 
irx 


L 


-F{x)a 


dx 


Pi 


Pi+\ 


Then it holds that 

G[X{x, t) + il-A)[x',t')) = g 

= g 


Yi2/Ti-l/Ti+i)Y{l/Ti+i) 

YillTi)^ 

G(x,f) = g(^) . 
/'Ax+( 1 -A)x')'^^'"'^“^’^' 


i Af+(1-A)t' 
At 


X (l-A)f' 
- + ■ 


At 


,At+(_l-A)t't At+{l-A)t't' 

> exp(-;6 {Af+(1-A)t'))g[-] g 

t ' \ t , 

= exp{-f{At+ (1 - A)t'))G(x, f)^G(x', 
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Because 


j G{x,t]dx= j g[-j dx=t" j g{xYdx=t^ 
through Prekopa-Leindler inequality, we have 


Y{t) 


(a + b\^^ 


(a+b 


> exp(-)S(fl+ h))a”F(fl)h”F(h) 


Take a = 2/ Ti - 1/ T.+i and b = 1/ T.+i. Then we have 




Pi+i 


F(2/r,--i/r,+i)F(i/r,+i) 

F(l/r;)2 


HTf 

(2/r,--i/r/+i)(i/r,+i) 


exp (2)6/7;) 


< 1 

! + ■ 


n-2\/n 


expiZpiTi) < e 


nl{n-2\/n) 


exp(2/3/r;) < 5exp(2/3/r;; 


□ 


Proof of Theorem 6. Let us prove ( 6 ) by induction. Suppose at the end of epoch i, we have 


dt^ia^l"\ng.) < 


1 + 


nlogl/p 


r 


We identify cr®^ = Tig.. Hence, by Step 2 in the proof of Theorem 5, we have 

< dtv (o-|7i’ ^g.+i) + + dtv(d'“\, ) 


= dtv(o-|™J,7rg,_n) + mc + 2dtv(d|"’\7rg,.) 

( 


1 + 


1 + 


1 


nlogl/p 

1 

nlogl/p 


r- 

z + 1 


■ + 


1 + 


1 


4nlogl/p nlogl/pj ' 4nlogl/p 


r- 


1 + 


nlogl/p 


r 


by choosing m = ©* (n^) and 


c = 


m 


1 H- 

nlogl/p 


r- 


4nlogl/p' 


Thus the final epoch, i = -/nlogl/p, the error is at most |l 


+ 


nlogl/p J 


\\/nlogl/p 


7 < ey. 


□ 


Proof of Theorem 7. In Kalai and Vempala (2006), the authors proved the theorem for the case fix) = 
c ■ x and claimed it can be extended to arbitrary convex functions. Here we give a proof for the sake of 
completeness. Kalai and Vempala (2006) proved above inequality with arbitrary convex set K. Let’s see 
how to relate an arbitrary convex function fix) to a linear function by increasing the dimension by 1 . 
Consider a convex set JF £ IR” and a continuous convex function /: R" — R. Consider the epigraph 

JF ;= {(x, y); X e JF, y> fix)}. 
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Define the linear function f{x,y) = c-[x,y), where [x,y) e and c = (0,1) e We have 
Ef[X, Y) > E/(X) the first one increase mass on large values 
min fix, y) = min/(x) 

(x,y)eJ? x€^ 


Thus 


E/(X) - min/(x) < EfiX, Y) - min fix, y) <in + l)T 

(x,y)ej? 


proof completed. 

For the second claim, it is not hard to see that 


Ep/{X)-min/(x) = Ef /(X)-min/(x) . 

Since adding a constant to the function does not have an effect on the density, we can assume without 
loss of generality that minxeic/Cx) = 0. Thus we have 

/jr /W exp{-F(x)/ Titfx fjr fix) exp{-/{x) / T}dx ■ exp{p/ T) 

^ /j^exp{-F{x)/r}dx fj^exp{-fix)/T}dx-expi-p/T] 

< exp{2p/ D • EfiX) <in+l)T- exp(2p/ T) 


and 


Ef/(X)-min/(x) <in + l)T-expiZpIT). 

xej?f 


□ 


Proof of Corollary 1. We choose p = eln and Tk = p. Given the final temperature, K = -fnloginle). The 
optimization guarantee follows from Theorem 7. The number of queries is 0* irfi) for one sample in 
one phase (Theorem 4) times 0* in) samples per phase for rounding (Section 5.2) times K = 0*i\/n) 
phases. The resulting distribution, however, is only ey-close to the distribution with density proportional 
to exp{-nF/e} (by Theorem 6). The guarantee of Theorem 7 holds for the latter distribution, and we need 
to upper bound the effect of having a sample from an almost-desired distribution. Thankfully, y enters 
logarithmically in oracle complexity. Since / is L-Lipschitz and domain is bounded, the range of function 
values over Xf is bounded by 5 = 0inLR). Then y can be chosen as e/B, which again only impacts oracle 
complexity by terms logarithmic in n, L, R. □ 
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